sgl-kernel-npu/python/deep_ep/README.md at main · sgl-project/sgl-kernel-npu

DeepEP-Ascend

English | 中文

Introduction

DeepEP-Ascend is the Ascend NPU implementation of DeepEP, providing highly optimized Expert Parallelism (EP) communication kernels for Mixture-of-Experts (MoE) models on Ascend hardware. It supports two communication modes:

Normal Mode: High-throughput dispatch and combine operations for training and prefill phases.
Low-Latency Mode: Optimized for production inference with small batch sizes, achieving sub-150us latency.

DeepEP-Ascend uses a strategy-based architecture that allows flexible selection of communication implementations via environment variables, supporting various hardware topologies (A2, A3, A5) and communication backends (HCCS, RDMA, AlltoAll).

Software and Hardware

Supported Hardware Models: Atlas A2, A3 (support CANN 8.5 and CANN 9.0), and Atlas A5 (only supports CANN 9.0). Platform: aarch64/x86

Supporting Software:

Driver Ascend HDK 25.1.RC1.1, CANN Community Edition 8.5.0 and later versions (refer to the CANN Software Installation Guide to install the CANN development kit package, as well as the supporting firmware and drivers)
Before installing CANN software, you need to install the relevant dependency list
Python >= 3.9, Recommendation: Python 3.11
PyTorch >= 2.8.0, torch-npu >= 2.8.0

Quick Start

DeepEP-Ascend supports A2, A3 and A5 and needs to generate packages separately on each platform.

Compile and Build

Prepare the CANN environment variables (modify according to the installation path)

source /usr/local/Ascend/ascend-toolkit/set_env.sh

Build the project

A5
```
bash build.sh -a deepep Ascend950
```
A3
```
bash build.sh -a deepep
```
A2
```
bash build.sh -a deepep2
```

Installation

Pip install the .whl file into your Python environment

pip install output/deep_ep*.whl

# Link to the deep_ep_cpp.*.so file
cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so && cd -

# (Optional) Confirm whether the import can be successful
python -c "import deep_ep; print(deep_ep.__path__)"

Execute the environment variables for CANN (modify according to the installation path)

source /usr/local/Ascend/ascend-toolkit/set_env.sh

In the Python project, import deep_ep

import deep_ep

Architecture

DeepEP-Ascend employs a strategy-based architecture where communication implementations are abstracted into interchangeable strategies, selected via environment variables.

Core Components

Component	File	Description
Buffer	`buffer.py`	Main entry point. Initializes communication buffer and delegates to strategy objects.
NormalStrategy	`ep_strategy.py` / `strategies/normal_strategy.py`	Normal mode dispatch/combine strategies (default, alltoall).
LowLatencyStrategy	`ep_strategy.py` / `strategies/low_latency_strategy.py`	Low-latency mode dispatch/combine strategies (default, ops, alltoall).
EventOverlap	`utils.py`	Event synchronization utility for async operations.
FuseMode	`buffer.py`	Enum for fused MoE computation modes.

Strategy Selection

Strategies are configured via environment variables at Buffer initialization:

Environment Variable	Value	Normal Strategy	Low-Latency Strategy
`DEEP_USE_MODE=default`	default	`DefaultNormalCommStrategy` (deep_ep_cpp custom ops)	`DefaultLowLatencyCommStrategy` (deep_ep_cpp custom ops)
`DEEP_USE_MODE=alltoall`	alltoall	`AlltoAllNormalCommStrategy` (torch.distributed alltoallv)	`AllToAllLowLatencyCommStrategy` (torch.distributed alltoall)
`DEEP_USE_MODE=default`	ops	`DefaultNormalCommStrategy` (deep_ep_cpp custom ops)	`OpsLowLatencyCommStrategy` (torch_npu ops)

Note: Invalid env (e.g., DEEP_USE_MODE=error) will raise a ValueError.

API Overview

The Buffer class is the primary interface. Below is a summary of the core APIs:

API	Mode	Description
`Buffer(group, num_nvl_bytes, num_rdma_bytes, ...)`	—	Initialize communication buffer with strategy selection.
`get_dispatch_layout(topk_idx, num_experts, ...)`	Normal	Calculate layout for subsequent dispatch. Returns `num_tokens_per_rank`, `num_tokens_per_rdma_rank`, `num_tokens_per_expert`, `is_token_in_rank`.
`dispatch(x, topk_idx, topk_weights, ...)`	Normal	Dispatch tokens to expert ranks. Returns received tokens, topk info, and a handle for combine.
`combine(x, handle, ...)`	Normal	Combine (reduce) tokens from dispatch. Must use the handle returned by `dispatch`.
`low_latency_dispatch(x, topk_idx, num_max_dispatch_tokens_per_rank, num_experts, ...)`	Low-Latency	Low-latency token dispatch for decode phase.
`low_latency_combine(x, topk_idx, topk_weights, handle, ...)`	Low-Latency	Low-latency token combine for decode phase.
`fused_deep_moe(x, topk_idx, topk_weights, ...)`	Fused	Fused dispatch + FFN + combine in a single call.
`get_dispatch_config(num_ranks)`	—	Get recommended Config for normal dispatch.
`get_combine_config(num_ranks)`	—	Get recommended Config for normal combine.
`clean_low_latency_buffer(...)`	—	Clean buffer before switching from normal to low-latency mode.

For detailed API documentation, see:

Communication Modes

Normal Mode (Prefill / Training)

High-throughput dispatch and combine for training and prefill phases:

A3: Pure HCCS intranode communication, full-mesh HCCS internode communication. No hierarchical implementation needed.
A2 Intranode: Pure HCCS communication, supports up to bs=8000 for normal dispatch/combine.
A2 Internode: Hierarchical (HCCS intranode + RDMA internode) or non-hierarchical (pure RDMA) implementation. Supports up to bs=4096.

Low-Latency Mode (Decode)

Optimized for inference with small batch sizes (128 tokens/batch):

A3: Supports default, ops, and alltoall strategies. ops strategy supports comm_alg options: hierarchy, fullmesh_v1, fullmesh_v2, ccu.
A2 Intranode: Supports up to bs=512 for low_latency dispatch/combine.
A2 Internode: Hierarchical (HCCS + RDMA) or non-hierarchical (pure RDMA) implementation. Supports up to bs=512.

Fused MoE

The fused_deep_moe API fuses dispatch + expert FFN computation + combine into a single operator call, significantly reducing communication overhead and end-to-end latency.

Two fuse modes are available via the FuseMode enum:

FuseMode.FUSED_DEEP_MOE (default): Full fusion of dispatch + FFN + combine.
FuseMode.DISPATCH_FFN_COMBINE: Dispatch + FFN + combine with separate dispatch handling.

Quantization modes (quant_mode):

1: INT8 quantization (default)
FP8 will be supported in A5 release.

See Fused Deep MoE API for details.

Environment Variables

Variable	Default	Description
`DEEP_USE_MODE`	`default`	Normal mode strategy and Low-latency mode strategy: `default`, `ops`, or `alltoall`.
`DEEP_NORMAL_MODE_USE_INT8_QUANT`	`0`	Enable INT8 quantization in normal dispatch. A2 internode does NOT support quantization in normal mode.
`SGLANG_DEEPEP_BF16_DISPATCH`	`0`	Disable quantization in low_latency_dispatch (BF16 dispatch). Set to `1` to disable; only effective in decode phase.
`MOE_EXPERT_TOKEN_NUMS_TYPE`	`1`	Dispatch return type for `num_recv_tokens_per_expert_list`: `1` = per-expert token count, `0` = prefix sum.
`MOE_SHARED_EXPERT_RANK_NUM`	`0`	Number of shared expert ranks (used by ops strategy).
`HCCL_BUFFSIZE`	`200` (MB)	HCCL buffer size in MB. Must be set when using DeepEP on A2.
`HCCL_INTRA_PCIE_ENABLE`	`0`	Set to `1` for A2 dual-node hierarchical communication.
`HCCL_INTRA_ROCE_ENABLE`	`1`	Set to `0` for A2 dual-node hierarchical communication.
`HCCL_OP_EXPANSION_MODE`	—	Must be disabled on A2 when using DeepEP (remove or unset this variable).
`DEBUG_MODE`	`OFF`	Set to `ON` to enable DEBUG logging for parameter tracing.

Platform-Specific Notes

A2 Single Node

Applicable when P/D node ranks = 8 (supports PD separation or mixed deployment).
Not recommended when ranks < 8 (insufficient parallelism for EP benefits).
Performance limits: normal up to bs=8000, low_latency up to bs=512.
Must set HCCL_BUFFSIZE (e.g., export HCCL_BUFFSIZE=1024).
Must disable HCCL_OP_EXPANSION_MODE.

For detailed A2 usage, see A2_DEEPEP_CN.md.

A2 Dual Node

Applicable when P/D node ranks > 8 (cross-node communication).
Normal mode does NOT support quantization (DEEP_NORMAL_MODE_USE_INT8_QUANT=0).
Must set HCCL_INTRA_PCIE_ENABLE=1 and HCCL_INTRA_ROCE_ENABLE=0 for hierarchical communication.
Performance limits: normal up to bs=4096, low_latency up to bs=512.

A3

Pure HCCS communication for both intranode and internode. No hierarchical implementation needed.
Supports ops strategy with multiple comm_alg options for low-latency mode.

A5

Only supports CANN 9.0.
Build with: bash build.sh -a deepep Ascend950.

Test

Execute DeepEP-related test scripts:

python3 tests/python/deepep/test_fused_deep_moe.py
python3 tests/python/deepep/test_intranode.py
python3 tests/python/deepep/test_low_latency.py

# A2 single-node tests
python3 tests/python/deepep/test_intranode.py --num-processes=8
python3 tests/python/deepep/test_low_latency.py --num-processes=8
python3 tests/python/deepep/test_normal_and_low_latency.py --num-processes=8

# A2 dual-node internode test (set primary node IP in run_test_internode.sh first)
bash tests/python/deepep/run_test_internode.sh

FAQ

If installing the .whl file results in the inability to import deep_ep in the project, check whether it is correctly installed in the site-packages directory of the current Python environment:

pip show deep-ep

If after installing the .whl, you encounter an issue where deep_ep_cpp is not found, you need to create a symbolic link of the deep_ep_cpp*.so files from the site-packages/deep_ep directory to the site-packages directory. Execute the following command in the site-packages directory:

ln -s deep_ep/deep_ep_cpp*.so

If you get a ValueError about unsupported mode combination, check that DEEP_NORMAL_MODE and DEEP_LOW_LATENCY_MODE are a valid pair. See the Strategy Selection table for valid combinations.
On A2, always set HCCL_BUFFSIZE before running DeepEP. Missing this will cause dispatch/combine operators to fail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepEP-Ascend

Introduction

Software and Hardware

Quick Start

Compile and Build

Installation

Architecture

Core Components

Strategy Selection

API Overview

Communication Modes

Normal Mode (Prefill / Training)

Low-Latency Mode (Decode)

Fused MoE

Environment Variables

Platform-Specific Notes

A2 Single Node

A2 Dual Node

A3

A5

Test

FAQ

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DeepEP-Ascend

Introduction

Software and Hardware

Quick Start

Compile and Build

Installation

Architecture

Core Components

Strategy Selection

API Overview

Communication Modes

Normal Mode (Prefill / Training)

Low-Latency Mode (Decode)

Fused MoE

Environment Variables

Platform-Specific Notes

A2 Single Node

A2 Dual Node

A3

A5

Test

FAQ