Skip to content

Conversation

fmiao2372
Copy link

FastDeploy在Intel HPU上已完成ERNIE 4.5模型的适配

依赖信息:
Gaudi software: 1.22.0
PaddlePaddle:3.1.1
PaddleCustomDevice: latest develop branch

更多模型的支持和性能的优化会继续更新。

Copy link

paddle-bot bot commented Sep 17, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Sep 17, 2025
@fmiao2372 fmiao2372 force-pushed the integration_upstreaming branch from 7e59562 to d7509a6 Compare September 17, 2025 12:49
try:
# assert len(paddle.static.cuda_places()) > 0
return True
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check doesn't seem to work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# PACKAGE = "fastdeploy.model_executor.ops.intel_hpu"
PACKAGE = "paddlenlp_ops"

import_custom_ops(PACKAGE, "paddlenlp_ops", globals())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here should be fastdeploy.model_executor.ops.intel_hpu instead of paddlenlp_ops ?

Is this because of the naming convention of the ops implementation in custom device?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, real custom ops come from paddlecustomdevice, we just rename it in fastdeploy

@@ -0,0 +1,21 @@
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024->2025

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

raise NotImplementedError


class AttentionBackend_HPU(AttentionBackend):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be better to move this class to fastdeploy/model_executor/layers/attention/hpu_attn_backend.py ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

"--enable-tensor-or-expert-parallel",
action='store_true',
default=EngineArgs.enable_tensor_or_expert_parallel,
help="Enable tensor parallelism for non-MoE and expert parallelism for MoE.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we enable tp + ep by setting --enable-expert-parallel and --tensor-parrllel-size without adding a new argument ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


parallel_config.engine_worker_queue_port = parallel_config.engine_worker_queue_port[
parallel_config.local_data_parallel_id
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CI fails at this line. TypeError: '\''int'\'' object is not subscriptable' . We need to solve it first and then see if there are any other problems

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,314 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

layers目录下有一个backends文件夹,里边放着各类device的layer有关的实现,把attention和moe的实现都放到这个文件夹下吧

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

按照要求,已经移动到了backends目录下

Comment on lines 121 to 122
elif current_platform.is_intel_hpu():
self.forward = self.forward_intel_hpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forard_cuda名字可能现在已经不太合适叫这个了,应该是可以复用forward_cuda的,逻辑都是一样的

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经改为复用forward_cuda

Comment on lines +212 to +213
elif current_platform.is_intel_hpu():
self.forward = self.forward_intel_hpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个和其他硬件平台有何不同之处吗,为啥需要单独写逻辑,不能抽象为几个op然后调用forward_cuda吗

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前我们采用fused的方式,是因为在我们平台上性能会比较好,后面会考虑在不影响性能的前提下进行拆分

from fastdeploy.platforms import current_platform


def reload_ep_checkpoint(model_path: str, fd_config: FDConfig, state_dict: dict, return_numpy: bool = False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么会修改加载模型这块儿的内容,是因为用的不是官方的模型吗

Copy link
Author

@fmiao2372 fmiao2372 Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有修改模型,还是官方的模型,只是为了支持TP+EP模式的模型加载

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们支持的TP+EP模式,是dense部分用TP,MoE部分不用TP(也不用DP),只用EP(EP的数目=TP)。所以在模型load的时候,如果配置了TP,默认会把MoE的系数也按照TP的模式切分了,这个reload_ep_checkpoint完成的功能,是把切分的MoE weights先删除掉,然后重新在expert维度将各自完整的weights划分给不同的卡。

self.expert_parallel_size = 1 # EP degree
self.data_parallel_size = 1 # DP degree
self.enable_expert_parallel = False
self.enable_tensor_or_expert_parallel = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不能通过enable_expert_parallel或者是expert_parallel_size,tensor_parallel_size等这些字段组合判断吗,必须要给用户接口加新字段吗

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前FD里面EP是和DP是绑定的,EP size等于DP size,而且moe里面限制了TP和EP不能同时开,所以支持TP+EP 最佳的选择是加一个参数
https://github.com/PaddlePaddle/FastDeploy/blob/develop/fastdeploy/model_executor/layers/moe/moe.py#L132-L134

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个参数的目的是让moe部分同时打开TP+EP并行吗?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dense部分用TP,MoE部分用EP(EP的数目=TP)。

cache_cfg = CacheConfig(all_dict)
load_cfg = LoadConfig(all_dict)
parallel_cfg = ParallelConfig(all_dict)
cache_cfg.enc_dec_block_num = self.static_decode_blocks
Copy link
Collaborator

@zoooo0820 zoooo0820 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be better to set this value as https://github.com/PaddlePaddle/FastDeploy/blob/release/2.2/fastdeploy/config.py#L899 to avoid impact on other hardware.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only for specific platform. It maybe be a bug, the parameter "static_decode_blocks" in EngineArgs can't be passed to cache_cfg even on GPUs because it has no static_decode_blocks but enc_dec_block_num

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is a problem that the static_decode_blocks not being passed to cache_cfg.

Could you please move enc_dec_block_num setting for different platforms to this file? Since this line works after the cache_cfg initialization, the default value 2 may cause error, e.g. Iluvatar

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after rebased to latest code, we can use FD_ENC_DEC_BLOCK_NUM to solve this problem. I had removed this line

else:
num_experts = model_config.moe_num_experts

num_experts_per_rank = num_experts // parallel_config.tensor_parallel_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么通过tp_size划分专家呢

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前FD里面的逻辑是如果enable EP,则需要用dp_size划分专家 + enable-expert-parallel,类似,我们可以用tp_size划分专家 + enable_tensor_or_expert_parallel来支持TP+EP模式(dense部分用TP,MoE部分用EP(EP的数目=TP))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_tensor_or_expert_parallel这个参数感觉不是很清晰啊,这种dense TP moe EP的切分可以参考也开源框架vllm/SGLang的命名实现?现在看着会比较让人迷惑

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

命名是有问题, 这个PR我们先去掉了相关的code,后面refine好后再提PR

@fmiao2372 fmiao2372 force-pushed the integration_upstreaming branch from 5137adb to 2ae2c61 Compare September 22, 2025 07:54
@fmiao2372
Copy link
Author

@zoooo0820 @carryyu @YuanRisheng @gzy19990617 ,我们暂时把TP+EP模式去掉了,后面refine好后再单独合并

@fmiao2372 fmiao2372 force-pushed the integration_upstreaming branch from 2ae2c61 to e81e85a Compare September 22, 2025 08:14
@fmiao2372 fmiao2372 force-pushed the integration_upstreaming branch from e81e85a to cdc1d07 Compare September 23, 2025 03:17
zoooo0820
zoooo0820 previously approved these changes Sep 23, 2025
Copy link
Collaborator

@zoooo0820 zoooo0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM



@dataclass
class ForwardMeta_HPU:
Copy link
Collaborator

@yuanlehome yuanlehome Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

命名是否可以与上面其他硬件保持一致呢,HPUForwardMeta

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已改为HPUForwardMeta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants